11 research outputs found

    VIDA: a virus database system for the organization of animal virus genome open reading frames

    Get PDF
    VIDA is a new virus database that organizes open reading frames (ORFs) from partial and complete genomic sequences from animal viruses. Currently VIDA includes all sequences from GenBank for Herpesviridae, Coronaviridae and Arteriviridae. The ORFs are organized into homologous protein families, which are identified on the basis of sequence similarity relationships, Conserved sequence regions of potential functional importance are identified and can be retrieved as sequence alignments. We use a controlled taxonomical and functional classification for all the proteins and protein families in the database. When available, protein structures that are related to the families have also been included. The database is available for online search and sequence information retrieval at http://www.biochem.ucl.ac.uk/bsm/virus-database/ VIDA.html

    Improving the performance of DomainDiscovery of protein domain boundary assignment using inter-domain linker index

    Get PDF
    BACKGROUND: Knowledge of protein domain boundaries is critical for the characterisation and understanding of protein function. The ability to identify domains without the knowledge of the structure – by using sequence information only – is an essential step in many types of protein analyses. In this present study, we demonstrate that the performance of DomainDiscovery is improved significantly by including the inter-domain linker index value for domain identification from sequence-based information. Improved DomainDiscovery uses a Support Vector Machine (SVM) approach and a unique training dataset built on the principle of consensus among experts in defining domains in protein structure. The SVM was trained using a PSSM (Position Specific Scoring Matrix), secondary structure, solvent accessibility information and inter-domain linker index to detect possible domain boundaries for a target sequence. RESULTS: Improved DomainDiscovery is compared with other methods by benchmarking against a structurally non-redundant dataset and also CASP5 targets. Improved DomainDiscovery achieves 70% accuracy for domain boundary identification in multi-domains proteins. CONCLUSION: Improved DomainDiscovery compares favourably to the performance of other methods and excels in the identification of domain boundaries for multi-domain proteins as a result of introducing support vector machine with benchmark_2 dataset

    Improved general regression network for protein domain boundary prediction

    Get PDF
    Background: Protein domains present some of the most useful information that can be used to understand protein structure and functions. Recent research on protein domain boundary prediction has been mainly based on widely known machine learning techniques, such as Artificial Neural Networks and Support Vector Machines. In this study, we propose a new machine learning model (IGRN) that can achieve accurate and reliable classification, with significantly reduced computations. The IGRN was trained using a PSSM (Position Specific Scoring Matrix), secondary structure, solvent accessibility information and inter-domain linker index to detect possible domain boundaries for a target sequence. Results: The proposed model achieved average prediction accuracy of 67% on the Benchmark_2 dataset for domain boundary identification in multi-domains proteins and showed superior predictive performance and generalisation ability among the most widely used neural network models. With the CASP7 benchmark dataset, it also demonstrated comparable performance to existing domain boundary predictors such as DOMpro, DomPred, DomSSEA, DomCut and DomainDiscovery with 70.10% prediction accuracy. Conclusion: The performance of proposed model has been compared favourably to the performance of other existing machine learning based methods as well as widely known domain boundary predictors on two benchmark datasets and excels in the identification of domain boundaries in terms of model bias, generalisation and computational requirements. © 2008 Yoo et al; licensee BioMed Central Ltd

    Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint

    Get PDF
    BACKGROUND: Structural genomics initiatives were established with the aim of solving protein structures on a large-scale. For many initiatives, such as the Protein Structure Initiative (PSI), the primary aim of target selection is focussed towards structurally characterising protein families which, so far, lack a structural representative. It is therefore of considerable interest to gain insights into the number and distribution of these families, and what efforts may be required to achieve a comprehensive structural coverage across all protein families. RESULTS: In this analysis we have derived a comprehensive domain annotation of the genomes using CATH, Pfam-A and Newfam domain families. We consider what proportions of structurally uncharacterised families are accessible to high-throughput structural genomics pipelines, specifically those targeting families containing multiple prokaryotic orthologues. In measuring the domain coverage of the genomes, we show the benefits of selecting targets from both structurally uncharacterised domain families, whilst in addition, pursuing additional targets from large structurally characterised protein superfamilies. CONCLUSION: This work suggests that such a combined approach to target selection is essential if structural genomics is to achieve a comprehensive structural coverage of the genomes, leading to greater insights into structure and the mechanisms that underlie protein evolution

    Defining Signatures of Arm-Wise Copy Number Change and Their Associated Drivers in Kidney Cancers.

    No full text
    Using pan-cancer data from The Cancer Genome Atlas (TCGA), we investigated how patterns in copy number alterations in cancer cells vary both by tissue type and as a function of genetic alteration. We find that patterns in both chromosomal ploidy and individual arm copy number are dependent on tumour type. We highlight for example, the significant losses in chromosome arm 3p and the gain of ploidy in 5q in kidney clear cell renal cell carcinoma tissue samples. We find that specific gene mutations are associated with genome-wide copy number changes. Using signatures derived from non-negative factorisation, we also find gene mutations that are associated with particular patterns of ploidy change. Finally, utilising a set of machine learning classifiers, we successfully predicted the presence of mutated genes in a sample using arm-wise copy number patterns as features. This demonstrates that mutations in specific genes are correlated and may lead to specific patterns of ploidy loss and gain across chromosome arms. Using these same classifiers, we highlight which arms are most predictive of commonly mutated genes in kidney renal clear cell carcinoma (KIRC)

    Biologische Datenbanken

    No full text

    Aneuploidy tolerance caused by BRG1 loss allows chromosome gains and recovery of fitness.

    Get PDF
    Aneuploidy results in decreased cellular fitness in many species and model systems. However, aneuploidy is commonly found in cancer cells and often correlates with aggressive growth, suggesting that the impact of aneuploidy on cellular fitness is context dependent. The BRG1 (SMARCA4) subunit of the SWI/SNF chromatin remodelling complex is frequently lost in cancer. Here, we use a chromosomally stable cell line to test the effect of BRG1 loss on the evolution of aneuploidy. BRG1 deletion leads to an initial loss of fitness in this cell line that improves over time. Notably, we find increased tolerance to aneuploidy immediately upon loss of BRG1, and the fitness recovery over time correlates with chromosome gain. These data show that BRG1 loss creates an environment where karyotype changes can be explored without a fitness penalty. At least in some genetic backgrounds, therefore, BRG1 loss can affect the progression of tumourigenesis through tolerance of aneuploidy

    Domains mediate protein-protein interactions and nucleate protein assemblies.

    No full text
    Cell physiology is governed by an intricate mesh of physical and functional links among proteins, nucleic acids and other metabolites. The recent information flood coming from large-scale genomic and proteomic approaches allows us to foresee the possibility of compiling an exhaustive list of the molecules present within a cell, enriched with quantitative information on concentration and cellular localization. Moreover, several high-throughput experimental and computational techniques have been devised to map all the protein interactions occurring in a living cell. So far, such maps have been drawn as graphs where nodes represent proteins and edges represent interactions. However, this representation does not take into account the intrinsically modular nature of proteins and thus fails in providing an effective description of the determinants of binding. Since proteins are composed of domains that often confer on proteins their binding capabilities, a more informative description of the interaction network would detail, for each pair of interacting proteins in the network, which domains mediate the binding. Understanding how protein domains combine to mediate protein interactions would allow one to add important features to the protein interaction network, making it possible to discriminate between simultaneously occurring and mutually exclusive interactions. This objective can be achieved by experimentally characterizing domain recognition specificity or by analyzing the frequency of co-occurring domains in proteins that do interact. Such approaches allow gaining insights on the topology of complexes with unknown three-dimensional structure, thus opening the prospect of adopting a more rational strategy in developing drugs designed to selectively target specific protein interactions
    corecore